1 Introduction

This page will introduce the use of Neotoma APIs and describe some situations when they might be preferable to the use of the Neotoma2 R package.

2 What is an API?

An API (application programming interface) is a set of rules that allows different computers or computer components to communicate. One set of APIs enable a user’s computer programs to access resources managed by the same computer’s operating system. For example, a program might request memory through an API known as a system call.

The APIs we’re concerned with here are Web APIs. That is, they’re APIs that use Web protocols like HTTP to enable communication between computers through the internet. Web APIs are fundamental to web-based information sharing.

2.1 Neotoma APIs

There are a range of Neotoma APIs that can be accessed through this page. When you call a Web API, it follows the HTTP protocol, which means you issue a request like GET, POST, or PUT, etc., and you receive a response. Let’s see how we can make these API calls in R. We’ll need to download the library httr, which has those HTTP calls like GET() as well as a content() function that helps decode the response we receive.

We’ll make the following API call: https://api.neotomadb.org/v2.0/data/occurrences?taxonname=Canis&limit=10. We want to wrap it in the GET() function, and then decode the contents.

firstAPI = GET("https://api.neotomadb.org/v2.0/data/occurrences?taxonname=Canis&limit=10")

print(firstAPI)
## Response [https://api.neotomadb.org/v2.0/data/occurrences?taxonname=Canis&limit=10]
##   Date: 2024-12-30 15:12
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 4.46 kB

Status: 200 is a good sign; that means the call issued successfully. Now we use content() to get the data. The output of content(), called insides, has three components: status, data, and message. Most of the time, status will be “success”, and message will be “retrieved all tables,” so we mostly care about the data. But in case you’re ever running int issues and need to debug, it can be helpful to consider what the status and message are.

insides = content(firstAPI)$data

print(insides[1])
## [[1]]
## [[1]]$occid
## [1] 1729801
## 
## [[1]]$sample
## [[1]]$sample$taxonid
## [1] 25
## 
## [[1]]$sample$taxonname
## [1] "Artemisia"
## 
## [[1]]$sample$value
## [1] 9
## 
## [[1]]$sample$sampleunits
## [1] "NISP"
## 
## 
## [[1]]$age
## [[1]]$age$age
## NULL
## 
## [[1]]$age$ageolder
## NULL
## 
## [[1]]$age$ageyounger
## NULL
## 
## 
## [[1]]$site
## [[1]]$site$datasetid
## [1] 1
## 
## [[1]]$site$siteid
## [1] 1
## 
## [[1]]$site$sitename
## [1] "15/1"
## 
## [[1]]$site$altitude
## [1] 244
## 
## [[1]]$site$location
## [1] "{\"type\":\"Point\",\"crs\":{\"type\":\"name\",\"properties\":{\"name\":\"EPSG:4326\"}},\"coordinates\":[-75.25,55.09167]}"
## 
## [[1]]$site$datasettype
## [1] "pollen surface sample"
## 
## [[1]]$site$database
## [1] "North American Pollen Database"

We just successfully ran our first API! But the format we received is a little hard to work with…

2.2 The JSON format

Web APIs return their responses in JSON (JavaScript Object Notation) format. JSON represents data as arrays of objects in which keys that define a property are assigned values. The value might be a number or string, or it could itself be an object or array of objects. Here’s a snippet of what JSON format looks like:

{“type”:“FeatureCollection”,“features”:[{“type”:“Feature”,“properties”:{“siteid”:“7”,“name”:“Three Pines Bog”,“description”:“Bog.”,“altitude”:“329”,“handle”:“3PINES”,“collectionunit”:null,“collectionunitid”:“7”,“collectionunittype”:“Core”,“datasetid”:“7”,“datasettype”:“pollen”},“geometry”:{“type”:“Point”,“coordinates”:[-80.11667,47.0]}},{“type”:“Feature”,“properties”:{“siteid”:“7”,“name”:“Three Pines Bog”,“description”:“Bog.”,“altitude”:“329”,“handle”:“3PINES”,“collectionunit”:null,“collectionunitid”:“7”,“collectionunittype”:“Core”,“datasetid”:“7857”,“datasettype”:“geochronologic”},“geometry”:{“type”:“Point”,“coordinates”:[-80.11667,47.0]}},{“type”:“Feature”,“properties”:{“siteid”:“10”,“name”:“Site 1 (Cohen unpublished)”,“description”:null,“altitude”:“36”,“handle”:“ADC001”,“collectionunit”:null,“collectionunitid”:“10”,“collectionunittype”:“Modern”,“datasetid”:“10”,“datasettype”:“pollen surface sample”},“geometry”:{“type”:“Point”,“coordinates”:[-82.33,30.83]}}

In R, it is natural to represent these JSON arrays as nested lists:

list(siteid = 7, sitename = “Three Pines Bog”, sitedescription = “Bog.”, geography = “{"type":"Point","crs":{"type":"name","properties":{"name":"EPSG:4326"}},"coordinates":[-80.11667,47]}”, altitude = 329, collectionunits = list(list(handle = “3PINES”, datasets = list(list(datasetid = 7, datasettype = “pollen”)), collectionunit = NULL, collectionunitid = 7, collectionunittype = “Core”), list(handle = “3PINES”, datasets = list(list(datasetid = 7857, datasettype = “geochronologic”)), collectionunit = NULL, collectionunitid = 7, collectionunittype = “Core”)))

However, it is often easier to visualize an API response as a table rather than a list, which requires some looping.

counter = 0
for (i in seq(length(api_sites4))) {
  for (j in seq(length(api_sites4[[i]]$collectionunits))) {
    counter = counter + 1
  }}

site_mat = matrix(nrow=counter,ncol=11)
idx = 0
for (i in seq(length(api_sites4))) {
  for (j in seq(length(api_sites4[[i]]$collectionunits))) {
    idx = idx + 1
    if(!is.null(api_sites4[[i]]$siteid)) {
    site_mat[[idx,1]] = api_sites4[[i]]$siteid}
      if(!is.null(api_sites4[[i]]$sitename)) {
    site_mat[[idx,2]] = api_sites4[[i]]$sitename}
      if(!is.null(api_sites4[[i]]$sitedescription)) {
    site_mat[[idx,3]] = api_sites4[[i]]$sitedescription}
      if(!is.null(api_sites4[[i]]$geography)) {
    site_mat[[idx,4]] = api_sites4[[i]]$geography}
      if(!is.null(api_sites4[[i]]$altitude)) {
    site_mat[[idx,5]] = api_sites4[[i]]$altitude}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$handle)) {
    site_mat[[idx,6]] = api_sites4[[i]]$collectionunits[[j]]$handle}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$collectionunit)) {
    site_mat[[idx,7]] = api_sites4[[i]]$collectionunits[[j]]$collectionunit}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$collectionunitid)) {
    site_mat[[idx,8]] = api_sites4[[i]]$collectionunits[[j]]$collectionunitid}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$collectionunittype)) {
    site_mat[[idx,9]] = api_sites4[[i]]$collectionunits[[j]]$collectionunittype}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$datasets[[1]]$datasetid)) {
    site_mat[[idx,10]] = api_sites4[[i]]$collectionunits[[j]]$datasets[[1]]$datasetid}
      if(!is.null(api_sites4[[i]]$collectionunits[[j]]$datasets[[1]]$datasettype)) {
    site_mat[[idx,11]] = api_sites4[[i]]$collectionunits[[j]]$datasets[[1]]$datasettype}
    
  }
}

site_df = as.data.frame(site_mat)
names(site_df) = c("siteid","name","description","geography","altitude","handle","collectionunit","collectionunitid","collectionunittype","datasetid","datasettype")

datatable(site_df, rownames=FALSE)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

2.3 The Neotoma2 R package and the API

You may not have used an API explicitly before, but if you’ve used the Neotoma2 R package, then you’ve already used it implicitly. All Neotoma2 functions at some point require use of the helper function neotoma2::parseURL(). This function is somewhat long, but we can use grep() to search through it for any mention of an API. When we do we get the following result:

baseurl <- switch(use, dev = "https://api-dev.neotomadb.org/v2.0/", ,         neotoma = "https://api.neotomadb.org/v2.0/", local = "http://localhost:3005/v2.0/", 

In other words, any Neotoma2 function is ultimately using the Neotoma API to communicate with the database. So why would we ever want to use the APIs directly, rather than mediate through the neotoma2 R package? There are at least two reasons: 1. It is much faster to use the API directly to download large amounts of data, and 2. Some Neotoma metadata is only available through the API, not the R package.

Let’s examine each of these reasons in order.

3 Call time comparison

lats = c(43, 50, 50, 43)
lons= c(-65, -65, -60, -60) 

coordinates = data.frame(lat = lats, lon = lons)

coordinates_sf = coordinates %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

bbox_geojson = sf_geojson(coordinates_sf)

R_getsites_time = system.time(neotoma2::get_sites(loc = bbox_geojson, all_data = TRUE))

api_sites = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson,"&limit=9999&offset=0")))$data

api_getsites_time = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson,"&limit=99999&offset=0")))$data)

print(R_getsites_time)
##    user  system elapsed 
##    0.84    0.03    5.58
print(api_getsites_time)
##    user  system elapsed 
##    0.03    0.00    0.72
print(length(api_sites))
## [1] 153
lats1 = c(43, 50, 50, 43)
lons1= c(-70, -70, -60, -60) 

coordinates1 = data.frame(lat = lats1, lon = lons1)

coordinates1_sf = coordinates1 %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

bbox_geojson1 = sf_geojson(coordinates1_sf)

R_getsites_time1 = system.time(neotoma2::get_sites(loc = bbox_geojson1, all_data = TRUE))

api_sites1 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson1,"&limit=9999&offset=0")))$data

api_getsites_time1 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson1,"&limit=99999&offset=0")))$data)


print(R_getsites_time1)
##    user  system elapsed 
##    2.75    0.14   14.23
print(api_getsites_time1)
##    user  system elapsed 
##    0.04    0.00    1.05
print(length(api_sites1))
## [1] 479
lats2 = c(33, 50, 50, 33)
lons2 = c(-75, -75, -60, -60)

coordinates2 = data.frame(lat = lats2, lon = lons2)

coordinates2_sf = coordinates2 %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

bbox_geojson2 = sf_geojson(coordinates2_sf)

R_getsites_time2 = system.time(neotoma2::get_sites(loc = bbox_geojson2, all_data = TRUE))

api_sites2 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson2,"&limit=9999&offset=0")))$data

api_getsites_time2 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson2,"&limit=99999&offset=0")))$data)


print(R_getsites_time2)
##    user  system elapsed 
##   10.06    0.27   61.11
print(api_getsites_time2)
##    user  system elapsed 
##    0.17    0.00    1.72
print(length(api_sites2))
## [1] 1664
lats3 = c(23, 50, 50, 23)
lons3 = c(-80, -80, -60, -60) 

coordinates3 = data.frame(lat = lats3, lon = lons3)

coordinates3_sf = coordinates3 %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

bbox_geojson3 = sf_geojson(coordinates3_sf)

R_getsites_time3 = system.time(neotoma2::get_sites(loc = bbox_geojson3, all_data = TRUE))

api_sites3 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson3,"&limit=9999&offset=0")))$data

api_getsites_time3 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson3,"&limit=99999&offset=0")))$data)


print(R_getsites_time3)
##    user  system elapsed 
##   18.58    0.76  133.85
print(api_getsites_time3)
##    user  system elapsed 
##    0.34    0.01    2.31
print(length(api_sites3))
## [1] 3205
lats4 = c(23, 50, 50, 23)
lons4 = c(-90, -90, -60, -60) # Reordered for a rectangle

coordinates4 = data.frame(lat = lats4, lon = lons4)

coordinates4_sf = coordinates4 %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

bbox_geojson4 = sf_geojson(coordinates4_sf)

R_getsites_time4 = system.time(neotoma2::get_sites(loc = bbox_geojson4, all_data = TRUE))
## Warning in .f(.x[[i]], ...): Dataset(s) 25582, 25583, 6448 may have been recently removed from the database. Affected sites/datasets will be removed when you do `get_datasets` or `get_downloads`
api_sites4 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson4,"&limit=9999&offset=0")))$data

api_getsites_time4 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",bbox_geojson4,"&limit=99999&offset=0")))$data)



print(R_getsites_time4)
##    user  system elapsed 
##   32.84    0.90  396.91
print(api_getsites_time4)
##    user  system elapsed 
##    0.59    0.01    4.05
print(length(api_sites4))
## [1] 6231
tm_shape(osm.raster(coordinates4_sf)) + tm_rgb() +
  tm_shape(coordinates4_sf) + tm_borders(col="red") +
  tm_shape(coordinates3_sf) + tm_borders(col="blue") +
  tm_shape(coordinates2_sf) + tm_borders(col="black") +
  tm_shape(coordinates1_sf) + tm_borders(col="green") +
  tm_shape(coordinates_sf) + tm_borders(col="white")
## Zoom: 5

Below you can see how the R package gets slower and slower the greater the number of sites you’re trying to grab. We added a few points that aren’t displayed above just to fill out the curve without adding too much clutter to the page.

Rtimes = c(R_getsites_time[[3]],R_getsites_time1[[3]],R_getsites_time2[[3]],R_getsites_time3[[3]],R_getsites_time4[[3]],R_getsites_timea[[3]],R_getsites_timeb[[3]],R_getsites_timec[[3]],R_getsites_timed[[3]],R_getsites_timee[[3]],R_getsites_timef[[3]],R_getsites_timeg[[3]])

apitimes = c(api_getsites_time[[3]],api_getsites_time1[[3]],api_getsites_time2[[3]],api_getsites_time3[[3]],api_getsites_time4[[3]],api_getsites_timea[[3]],api_getsites_timeb[[3]],api_getsites_timec[[3]],api_getsites_timed[[3]],api_getsites_timee[[3]],api_getsites_timef[[3]],api_getsites_timeg[[3]])

site_num = c(length(api_sites),length(api_sites1),length(api_sites2),length(api_sites3),length(api_sites4),length(api_sitesa),length(api_sitesb),length(api_sitesc),length(api_sitesd),length(api_sitese),length(api_sitesf),length(api_sitesg))

time_df = data.frame(Rt = Rtimes,api_t = apitimes, sites = site_num)

ggplot(time_df) +
  geom_point(mapping=aes(x=sites,y=Rt),color="red",alpha=0.7) +
  geom_point(mapping=aes(x=sites,y=api_t),color="blue",alpha=0.7) +
  theme_bw() +
  scale_y_continuous(name="time (seconds)") +
  scale_x_continuous(name = "number of sites")

4 Access More Metadata

Neotoma contains extensive metadata that are not all exposed through the R package. The API is useful for gathering these metadata. For example, you can see in the image below (from the Neotoma open schema), some of the metadata tables linked to sites:

In order to grab, say, the siteimages table, we need to use a particularly versatile API call, “/v2.0/data/dbtables/”, that downloads whichever table name you supply. When we look at our results, we see nine image locations, all associated with the Illinois State Museum. Unfortunately, if you actually try to access these photos, you’ll find that the linked location is no longer connected to an image.

siteimages = content(GET("https://api.neotomadb.org/v2.0/data/dbtables/siteimages?count=false&limit=9999&offset=0"))$data


image_mat = matrix(nrow=length(siteimages),ncol=9)

for (i in seq(length(siteimages))) {
  for(j in seq(9)) {
    if (!is.null(siteimages[[i]][[j]])) {
      image_mat[[i,j]] = siteimages[[i]][[j]]
    }
  }
}

image_df = as.data.frame(image_mat)
names(image_df) = c("siteimageid","siteid","contactid","caption","credit","date","siteimage","recdatecreated","recdatemodified")

datatable(image_df,rownames=FALSE)